Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
R is the preferred language for Data analytics due to its open source development and high extensibility. Exponential growth in data has caused longer processing times leading to the rise in parallel computing technologies for analysis. Using R together with high performance computing resources is a cumbersome task. This paper proposes a framework that provides users with access to high-performance computing resources and simplifies the configuration, programming, uploading data and job scheduling through a web user interface. In addition to that, it provides two modes of parallelization of data-intensive computing tasks, catering to a wide range of users. The case studies emphasize the utility and efficiency of the framework. The framework provides better performance, ease of use and high scalability.more » « less
-
With the increase in data-driven analytics, the demand for high performing computing resources has risen. There are many high-performance computing centers providing cyberinfrastructure (CI) for academic research. However, there exists access barriers in bringing these resources to a broad range of users. Users who are new to data analytics field are not yet equipped to take advantage of the tools offered by CI. In this paper, we propose a framework to lower the access barriers that exist in bringing the high-performance computing resources to users that do not have the training to utilize the capability of CI. The framework uses divide-and-conquer (DC) paradigm for data-intensive computing tasks. It consists of three major components - user interface (UI), parallel scripts generator (PSG) and underlying cyberinfrastructure (CI). The goal of the framework is to provide a user-friendly method for parallelizing data-intensive computing tasks with minimal user intervention. Some of the key design goals are usability, scalability and reproducibility. The users can focus on their problem and leave the parallelization details to the framework.more » « less
-
As the volume of data and technical complexity of large-scale analysis increases, many domain experts desire powerful computational and familiar analysis interface to fully participate in the analysis workflow by just focusing on individual datasets, leaving the large-scale computation to the system. Towards this goal, we investigate and benchmark a family of Divide-and-Conquer strategies that can help domain experts perform large-scale simulations by scaling up their analysis code written in R, the most popular data science and interactive analysis language. We implement the Divide-and-Conquer strategies that use R as the analysis (and computing) language, allowing advanced users to provide custom R scripts and variables to be fully embedded into the large-scale analysis workflow in R. The whole process will divide large-scale simulations tasks and conquer tasks with Slurm array jobs and R. Simulations and final aggregations are scheduled as array jobs in parallel means to accelerate the knowledge discovery process. The objective is to provide a new analytics workflow for performing similar large-scale analysis loops where expert users only need to focus on the Divide-and-Conquer tasks with the domain knowledge.more » « less
An official website of the United States government
